Locality Optimization on a NUMA Architecture for Hybrid LU Factorization

نویسندگان

  • Adrien Rémy
  • Marc Baboulin
  • Masha Sosonkina
  • Brigitte Rozoy
چکیده

We study the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and memory on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We apply these placement strategies and present performance results for a hybrid multicore/GPU LU algorithm as it is implemented in the public domain library MAGMA.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Tool Environment for Efficient Execution of Shared Memory Programs on NUMA Systems

One of the most important performance issues on NUMA systems is data locality since remote memory accesses have latencies several magnitudes higher than local memory accesses. This paper presents a tool environment targeting at tuning NUMA-based shared memory applications towards better memory locality. This tool environment comprises tools, supporting system facilities, and their interface. To...

متن کامل

THE USE OF SEMI INHERITED LU FACTORIZATION OF MATRICES IN INTERPOLATION OF DATA

The polynomial interpolation in one dimensional space R is an important method to approximate the functions. The Lagrange and Newton methods are two well known types of interpolations. In this work, we describe the semi inherited interpolation for approximating the values of a function. In this case, the interpolation matrix has the semi inherited LU factorization.

متن کامل

More Data Locality for Static Control Programs on NUMA Architectures

The polyhedral model is powerful for analyzing and transforming static control programs, hence its intensive use for the optimization of data locality and automatic parallelization. Affine transformations excel at modeling control flow, to promote data reuse and to expose parallelism. The approach has also successfully been applied to the optimization of memory accesses (array expansion and con...

متن کامل

Experiments with Cholesky Factorization on Clusters of SMPs

Cholesky factorization of large dense matrices is an integral part of many applications in science and engineering. In this paper we report on experiments with different parallel versions of Cholesky factorization on modern high-performance computing architectures. For the parallelization of Cholesky factorization we utilized various standard linear algebra software packages and present perform...

متن کامل

Topology-Aware Parallelism for NUMA Copying Collectors

NUMA-aware parallel algorithms in runtime systems attempt to improve locality by allocating memory from local NUMA nodes. Researchers have suggested that the garbage collector should profile memory access patterns or use object locality heuristics to determine the target NUMA node before moving an object. However, these solutions are costly when applied to every live object in the reference gra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013